Introduction

In order to integrate a new feature of on/off target levels of coverage in NGSphy, I needed to understand coverage variation that one could find in a capture experiment. Goals are basically:

  1. Understand the coverage variation within samples and loci to be able to model it for both NGSphy parameterization and furthers simulations.
  2. Find out whether there is a correlation between the coverage and the phylogenetic distance to the reference species, used for the probe generation. Expecting that the closer the sample is to the reference the higher the coverage obtained is.

Data

Data used comes from Rute’s Hummingbird project. Most of the information reported here about the samples and targets was extracted from the paper.

Processing and storage

Workspace triploid (UVigo):

  • Original data from Dataset 1, is stored in: triploid.uvigo.es
  • Under the user folder: /home/merly/ research/cph-bam-coverage

Workspace randy (KU):

  • Original data from Dataset 1 and 2, is stored in: randy.popgen.dk
  • Under the user folder:
    • /home/merly/anna
    • /home/merly/swift

Targeted regions

Targeted regions are exons, retrieved a posteriori, due to the fact that the original probes for this project was lost during a flood.

WD="/media/merly/Baymax/research/cph-visit/coverage-analysis/files/"
birds48filename=paste0(WD,"48birds_ortholog.list.chi.anna.cpe.hum.finch")
swiftgff=paste0(WD,"Chaetura_pelagica.CDS.gff")
annagff=paste0(WD,"Calypte_anna.gene.CDS.2750.gff")
birds=read.table(birds48filename, header=T,
    colClasses=c("character","numeric","character","character",
      "character","character","character"))
swift=read.table(swiftgff,
    colClasses=c("character","character","character","numeric",
      "numeric","character","character","character","character"))
anna=read.table(annagff,
    colClasses=c("character","character","character","numeric",
      "numeric","character","character","character","character"))

birds.2=birds[birds$anna!="-",]
birds.3=birds.2[birds.2$swift!="-",]
birds.4=unique(birds.3)
birds.5=birds.4[,c("anna","swift")]
birds.5$swift=paste0("Parent=",birds.5$swift,";")
birds.5$anna=paste0("Parent=Aan_", birds.5$anna,";")
annaGenes=unique(anna$V9[birds.5$anna %in% anna$V9])
birds.6=birds.5[birds.5$anna %in% annaGenes, ]
swiftGenes=birds.6$swift

swift.2=swift[swift$V3=="CDS",]
swift.3=swift.2[swift.2$V9 %in% birds.6$swift,]
swift.3$size=swift.3$V5-swift.3$V4
swift.4=swift.3[swift.3$size>0,]
anna.2=anna[(anna$V5-anna$V4)>0,]
swift.4=swift.4[,1:(ncol(swift.4)-1)]
write.table(swift.4[,c(1,4,5)],
  file=paste0(WD,"targets.swift.2.bed"),  
  row.names = F,col.names = F, quote = F, sep = "\t")
write.table(swift.4,
  file=paste0(WD,"Chaetura_pelagica.CDS.2.gff"),  
  row.names = F,col.names = F, quote = F, sep = "\t")
write.table(anna.2[,c(1,4,5)],
  file=paste0(WD,"targets.anna.2.bed"),
  row.names = F,col.names = F, quote = F, sep = "\t")
write.table(anna.2,
  file=paste0(WD,"Calypte_anna.gene.CDS.2750.2.gff"),
  row.names = F,col.names = F, quote = F, sep = "\t")
write.table(birds.6,
  file=paste0(WD,"Calypte_anna.Chaetura_pelagica.ids"),
  row.names = F,col.names = F, quote = F, sep = "\t")                                      

Description of the targeted regions

This is a quantitative summary description of the resulting target files.

Number of targets Size of targeted genome (bp) Number of genes Total of scaffolds
Anna 24,044 3,934,867 2,750 389
Swift 14,764 2,279,626 1,469 372

Pending questions:

  • Is the difference between number of genes, and subsequently in total “targeted” genenome, in both Anna and Swift, alarming?

Number of targeted regions per gene

Number of targeted regions per scaffold

  • Scaffolds: where genes are located.

Gene size distribution

Min. 1st Qu. Median Mean 3rd Qu. Max.
Anna 104 724 1,136 1,431 1,800 13,150
Swift 110 843 1,281 1,552 1,943 13,000

Targeted regions size distribution

Min. 1st Qu. Median Mean 3rd Qu. Max.
Anna 1 90 125 163.7 170 5,246
Swift 1 89 123 154.4 166 5,240

Pending questions:

  • These target size distribution are pretty much alike, but, there are some targeted regions up to ~5Kb, is this really the case? were these regions expected (in terms of size)?

On-target coverage

I obtained the coverage of the targeted regions using bedtools (v. 2.22.0), module coverage. With this, and the option -hist I can report a histogram of coverage for each feature in A as well as a summary histogram for _all_ features in A. Output (tab delimited) after each feature in A (from bedtools documentation))

  1. depth
  2. # bases at depth
  3. size of A
  4. % of A at depth
for bamfile in $(find $HOME/anna/originals -name "*.bam"); do
  tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
  nohup /home/merly/src/bedtools2/bin/bedtools coverage -hist -abam $bamfile -b Chaetura_pelagica.CDS.2.gff | gzip > $HOME/anna/bedtools/cov/${tag}.cov.gz &
done
for bamfile in $(find $HOME/swift/originals -name "*.bam"); do
  tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
  nohup /home/merly/src/bedtools2/bin/bedtools coverage -hist -abam $bamfile -b Calypte_anna.gene.CDS.2750.2.gff | gzip > $HOME/swift/bedtools/cov/${tag}.cov.gz &
done

And so, I filtered the output to keep the coverage per region and the summary histogram separated.

# (i.e)
for tag in $(cat $HOME/anna/files/samples.txt  ); do
  nohup zcat  $HOME/anna/bedtools/cov/H09.cov.gz | grep -v ^all | gzip >  $HOME/anna/bedtools/nohist/H09.nohist.gz &
  nohup zcat  $HOME/anna/bedtools/cov/H09.cov.gz | grep  ^all | gzip  >    $HOME/anna/bedtools/hist/H09.hist.gz &
done
for tag in $(cat $HOME/swift/files/samples.txt | tail -n+2 | head -46 ); do
  nohup zcat  $HOME/swift/bedtools/cov/H09.cov.gz | grep -v ^all | gzip >  $HOME/swift/bedtools/nohist/H09.nohist.gz &
  nohup zcat  $HOME/swift/bedtools/cov/H09.cov.gz | grep  ^all | gzip  >    $HOME/swift/bedtools/hist/H09.hist.gz &
done

Breadth vs. depth

With the information extracted from the bedtools coverage -hist, we can see the relation between the breadth of coverage obtained and the depth per sample.

Map2 - target Coverage per target overview
map2Anna
map2Swift

Zoom: 0x-100x.

map2Anna map2Swift

Depth of coverage

From the bedtools coverage output I was able to extract the depth of coverage per target, gene and sample. I generated 2 matrices:

  • Target depth: dimensions – number of targets \(\times\) number of samples
  • Gene depth: dimensions – number of genes \(\times\) number of samples

Each cell of the matrix correspond to the sum of the number of times each base was covered within the target or the gene. Then, coverage was calculated as the mean value of all the depth of coverage of all the samples for a specific target/gene divided by the size of the corresponding target/gene.

For the depth of coverage for the samples, I summed the coverage of all targets and divided it by the total amount of bases that were targeted.

Note: Outliers presented below were calculated using the following:

qnt <- quantile(data, probs=c(.25, .75))
H <- 1.5 * IQR(data)
y <- data
y[data < (qnt[1] - H)] <- "Outlier"
y[data > (qnt[2] + H)] <- "Outlier"

Coverage per sample

Min. 1st Qu. Median Mean 3rd Qu. Max.
map2Anna 1.977 45.55 74.34 97.06 110.5 296.9
map2Swift 1.715 38.27 62.29 83.8 98.07 265.6

Coverage per gene

Min. 1st Qu. Median Mean 3rd Qu. Max.
map2Anna 4.827 63.4 93.26 99.66 126.7 4,190
map2Swift 0.8245 54.24 81.06 85.55 111.5 702.6

Outlier check - map2Anna

  • Why there are some genes with extremely high depth of coverage?
  • Identify these outliers
# Get the outliers
qnt <- quantile(coveragePerGeneAnna, probs=c(.25, .75))
H <- 1.5 * IQR(coveragePerGeneAnna)
y <- rep("black",length(coveragePerGeneAnna))
outliers=coveragePerGeneAnna
outliers[coveragePerGeneAnna < (qnt[1] - H)] <- -1
outliers[coveragePerGeneAnna > (qnt[2] + H)] <- -1
outliers[outliers>=0]=0
outliers[outliers<0]=1
numOutliers=sum(outliers)
sizeOutlier=coveragePerGeneAnna[coveragePerGeneAnna>4000]
outlier4KIndex=which(coveragePerGeneAnna>4000)
outlier4K=unique(anna.2$V9)[outlier4KIndex]
  • There are 40, and coverage of those vary in the following way:
Min. 1st Qu. Median Mean 3rd Qu. Max.
222.5 233.7 251.9 358.9 280 4190

There is one gene with an extremely high coverage ( gene Parent=Aan_R000349; with coverage: 4,190.085), and these are the targets of the selected gene:

cat files/Calypte_anna.gene.CDS.2750.2.gff | grep "Parent=Aan_R000349;" | awk '{ print $0,"\t", $5-$4}'

Scaffold3026    GeneWise    CDS 3   230 .   +   0   Parent=Aan_R000349;      227

This might be because of the gene size, which has a size of 227 bp and only one target. In addition to that, this gene is present only in the Anna targets. I looked through the coverage of the samples, and selected the one with higher coverage, H1. Afterwards, went to the BAM file of that sample and looked this gene region (for position 3-320) with:

samtools mpileup H1_map2Anna_merge_pair.clean0.sort.bam -r Scaffold3026:3-230 

And the first 10 bp are shown up here, but most of the region look similar:

Scaffold3026    3       N       1408    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Scaffold3026    4       N       1536    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Scaffold3026    5       N       1690    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Scaffold3026    6       N       1848    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Scaffold3026    7       N       2110    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Scaffold3026    8       N       2274    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Scaffold3026    9       N       2413    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Scaffold3026    10      N       2652    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Scaffold3026    11      N       2874    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
Scaffold3026    12      N       3118    CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
  • An so, it is possible that this high coverage is also due to a repetitive element region.

Outlier check - map2Swift

  • Why there are some genes with extremely high depth of coverage?
  • Identify these outliers
qnt <- quantile(coveragePerGeneSwift, probs=c(.25, .75))
H <- 1.5 * IQR(coveragePerGeneSwift)

outliersSwiftHigh=coveragePerGeneSwift
outliersSwiftHigh[coveragePerGeneSwift > (qnt[2] + H)] <- -1
outliersSwiftHigh[outliersSwiftHigh>=0]=0
outliersSwiftHigh[outliersSwiftHigh<0]=1
numOutliersHigh=sum(outliersSwiftHigh)
sizeOutlierHigh=coveragePerGeneSwift[coveragePerGeneSwift > (qnt[2] + H)]
outlierIndex=which(coveragePerGeneSwift > (qnt[2] + H))
outliersGenesSwiftHigh=unique(swift.4$V9)[outlierIndex]

outliersSwiftLow=coveragePerGeneSwift
outliersSwiftLow[coveragePerGeneSwift < (qnt[1] - H)] <- -1
outliersSwiftLow[outliersSwiftLow>=0]=0
outliersSwiftLow[outliersSwiftLow<0]=1
numOutliersLow=sum(outliersSwiftLow)
sizeOutlieLow=coveragePerGeneSwift[coveragePerGeneSwift < (qnt[1] - H)]
outlierIndex=which(coveragePerGeneSwift < (qnt[1] - H))
outliersGenesSwiftLow=unique(swift.4$V9)[outlierIndex]

These are the outliers with high and low coverage, and their coverage distribution. All of the outliers that exist, are outliers with high coverage.

Number of outliers Min. 1st. Qu. Median Mean 3rd. Qu Max.
High Coverage 19 199.1 207.3 215.7 246.4 234.3 702.6
Low Coverage 0 NA NA NA NaN NA NA

I’m interested in those genes with coverage higher than the 3rd. quartile of the outliers with high coverage.

Parent=Cpe_R001479; Parent=Cpe_R001695; Parent=Cpe_R003268; Parent=Cpe_R011470; Parent=Cpe_R013115;
Coverage 702.6021 265.057 261.7742 240.9443 250.3321
Size 794.0000 145.000 137.0000 349.0000 774.0000

I have these scaffolds containing these genes and the number of targets in them:

Scaffolds Genes Coverage Size Num.Targets
scaffold116 Parent=Cpe_R001479; Parent=Cpe_R001479; 794 7
scaffold12 Parent=Cpe_R001695; Parent=Cpe_R001695; 145 2
scaffold158 Parent=Cpe_R003268; Parent=Cpe_R003268; 137 1
scaffold53 Parent=Cpe_R011470; Parent=Cpe_R011470; 349 2
scaffold69 Parent=Cpe_R013115; Parent=Cpe_R013115; 774 6

Gene size vs. depth of coverage

I’m looking for a relation between gene size and depth of coverage. So I fit a linear model to the data.

## 
## Call:
## lm(formula = genesAnna$Size ~ coveragePerGeneAnna)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1365.3  -705.4  -291.1   361.1 11690.0 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1474.6592    31.3834  46.989   <2e-16 ***
## coveragePerGeneAnna   -0.4395     0.2314  -1.899   0.0577 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1116 on 2748 degrees of freedom
## Multiple R-squared:  0.001311,   Adjusted R-squared:  0.0009473 
## F-statistic: 3.606 on 1 and 2748 DF,  p-value: 0.05766

## Parent=Aan_R000349; 
##                  10
## 
## Call:
## lm(formula = genesAnna$Size[indices] ~ coveragePerGeneAnna[indices])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1397.1  -706.8  -292.7   352.4 11668.7 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1511.009     47.891  31.551   <2e-16 ***
## coveragePerGeneAnna[indices]   -0.812      0.437  -1.858   0.0633 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1116 on 2747 degrees of freedom
## Multiple R-squared:  0.001255,   Adjusted R-squared:  0.0008916 
## F-statistic: 3.452 on 1 and 2747 DF,  p-value: 0.06327

CHECK THIS OUT!

I need to understand how to interpret this

Coverage per target

Min. 1st Qu. Median Mean 3rd Qu. Max.
map2Anna 0 30.75 79.58 163 174.3 39,220
map2Swift 0 23.13 64.86 136.7 145 56,610

Check

  • There are some targets with extremely high depth of coverage. Reasons?
  • Identify these outliers

Target size vs. depth of coverage

I’m looking for a relation between target size and depth of coverage. So I fit a linear model to the data.

## 
## Call:
## lm(formula = targets$Size ~ coveragePerTargetAnna)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -167.5  -72.7  -39.2    4.2 5077.5 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           168.451377   1.354405  124.37   <2e-16 ***
## coveragePerTargetAnna  -0.029431   0.002159  -13.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 202.8 on 24042 degrees of freedom
## Multiple R-squared:  0.007669,   Adjusted R-squared:  0.007628 
## F-statistic: 185.8 on 1 and 24042 DF,  p-value: < 2.2e-16

CHECK THIS OUT!

I need to understand how to interpret this

Coverage distribution per target

Min. 1st Qu. Median Mean 3rd Qu. Max.
Anna 0 30.75 79.58 163 174.3 39,220
Swift 0 23.13 64.86 136.7 145 56,610

Overview

While there is no specific color key for the coverage values, this will only serve as an overview of the coverage per target/gene per sample.

  • Yellow maximum coverage
  • Red minimum coverage.
  • 100 Genes for each dataset
map2Anna map2Swift
map2Anna map2Swift
  • 100 target regions for each dataset
map2Anna map2Swift
map2Anna map2Swift

Genes not captured

  • This is a summary of the genes that were not totally captured by a single base, per sample. Meaning, coverage is below 1x.

  • Are these genes the same per sample?
    • No, they are not. This is the list of genes not captured in the Swift dataset, taking into account that the Anna dataset has all genes with coverage >0x.
    • H46 and H47 were samples with really low coverage, it can be expected to loose some genes when doing the capture.
Genes not captured
Parent=Cpe_R000823;
Parent=Cpe_R008410;
Parent=Cpe_R008909;

Targeted regions not captured

This is a summary of the target regions that were not captured by a single base, per sample.

Checking whether the targeted regions that were not capture are the same for all the samples. The reason why the regions matched is because the target regions file was generated a posteriori from a list of exons coming from the Chicken genome. It might be possible that:

  1. such exons are not present in these species
  2. the probes designed for such regions were not effective and, therefore, such regions were not recovered.
  3. also, taking into account that the second dataset is mapped to an outgroup, is possible that such targets were not even in the outgroup reference and so not captured.
Number of common target regions Percentage (common target/total target regions)
Within map2Anna samples 400 1.663617
Within map2Swift samples 153 1.036304
Across datasets 0 0.000000
  • This shows that there are not common target regions across datasets. Taking into account the things above, (3) does not match, because the target file of the map2Swift was match (by genes) to the map2Anna target file.
  • Makes sense that the number of targets not covered is lower in the map2Swift dataset, due to the lower number of targets in general, even though the percentages are similar (1%)
  • I cannot find a possible reason why these regions where not captured, but the fact that such targets simply did not work.

Min. X1st.Qu. Median Mean X3rd.Qu. Max.
Anna 1 95 130.5 175.9 178 2,919
Swift 1 93 124 135.6 169 525
  • Then, going through the sizes of those targets, it does not show much, mean of the sizes is ~130 bp, which is close to the size of the probes.

Off-target regions

Generation of on and off target datasets

First, is important to define what is an off-target region. For the purpose of this analysis, there are two (2) definition of an off-target region. Also, I’m assuming that the all the possible DNA captured (on/off-target) is defined by a set of scaffolds in the Calypte_anna.gene.CDS.2750.2.gff file.

There are 580 that correspond to ~ 1,089,987,593 bp. So, taking this into account:

  1. The first definition considers off-target regions as a set difference between the scaffolds and the regions in the targets file.
    1. map2Anna
      • off-targets \(=\) Scaffolds \(\setminus\) Calypte_anna.gene.CDS.2750.2.gff targets.
      • These off-targets contain 18,263 targets, corresponding to ~1,087,045,505 bp.
    2. mapt2Swift
      • off-targets \(=\) Scaffolds \(\setminus\) Chaetura_pelagica.CDS.2.gff targets.
      • These off-targets contain 14,972 targets, corresponding to ~1,087,707,387 bp.
  2. The second definition considers off-target regions as a set difference between the scaffolds and the regions in Calypte_anna.gene.CDS.2750.2.gff and the regions in Chaetura_pelagica.CDS.2.gff.
    • off-targets \(=\) scaffolds \(\setminus\) (Calypte_anna.gene.CDS.2750.2.gff targets \(\cup\) Chaetura_pelagica.CDS.2.gff targets).
    • These off-targets contain 32,606, corresponding to ~1,084,770,828

Generation of the off-target datasets starts with the identification of the off-target regions. First, obtaining the size of the scaffolds. This has been made by identifying the highest position per scaffold in Calypte_anna.gene.CDS.2750.2.gff and in Chaetura_pelagica.CDS.2.gff, getting with this the sizes of the scaffolds. Afterwards, generating a BED file with the regions (scaffold, startPosition, endPosition) and from which I subtracted both GFF target files. In this way I have the regions that were not targeted, and that are within the range of the data.

Off-target regions across samples

The idea is to identify the off-target regions, as well as the size that covers and how is the coverage distribution within this regions compared to the on-target coverage.

To obtain coverage information from the off-target regions:

for bamfile in $(find $originalsAnna -name "*.bam" ); do
  tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
  nohup /home/merly/src/bedtools2/bin/bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.anna.bed" | gzip > $HOME/anna/bedtools/cov2/${tag}.cov.gz &
  nohup /home/merly/src/bedtools2/bin/bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.bed" | gzip > $HOME/anna/bedtools/cov3/${tag}.cov.gz &
done
for bamfile in $(find $originalsSwift -name "*.bam"  ); do
  tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
  nohup /home/merly/src/bedtools2/bin/bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.swift.bed" | gzip > $HOME/swift/bedtools/cov2/${tag}.cov.gz &
  nohup /home/merly/src/bedtools2/bin/bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.bed" | gzip > $HOME/swift/bedtools/cov3/${tag}.cov.gz &
done
#  hist split
for tag in $(cat $HOME/anna/files/samples.txt  ); do
  zcat $HOME/anna/bedtools/cov2/$tag.cov.gz | grep -v ^all | gzip > $HOME/anna/bedtools/nohist2/$tag.nohist.gz
  zcat $HOME/anna/bedtools/cov2/$tag.cov.gz | grep ^all | gzip > $HOME/anna/bedtools/hist2/$tag.hist.gz
  zcat $HOME/anna/bedtools/cov3/$tag.cov.gz | grep -v ^all | gzip > $HOME/anna/bedtools/nohist3/$tag.nohist.gz
  zcat $HOME/anna/bedtools/cov3/$tag.cov.gz | grep ^all | gzip > $HOME/anna/bedtools/hist3/$tag.hist.gz
done
for tag in $(cat $HOME/swift/files/samples.txt  ); do
  zcat $HOME/swift/bedtools/cov2/$tag.cov.gz | grep -v ^all | gzip > $HOME/swift/bedtools/nohist2/$tag.nohist.gz
  zcat $HOME/swift/bedtools/cov2/$tag.cov.gz | grep ^all | gzip > $HOME/swift/bedtools/hist2/$tag.hist.gz
  zcat $HOME/swift/bedtools/cov3/$tag.cov.gz | grep ^all | gzip > $HOME/swift/bedtools/hist3/$tag.hist.gz
  zcat $HOME/swift/bedtools/cov3/$tag.cov.gz | grep -v ^all | gzip > $HOME/swift/bedtools/nohist3/$tag.nohist.gz
done

Breadth vs. depth of coverage

I followed what has been done for the on-target regions to get this information. Getting how much of the off-target regions was captured, and at which depth.

map2Anna map2Swift
X-axis: 0-20 X-axis: 0-100
  • For the map2Anna dataset, for both off-target sets breadth vs. depth looks the same. Same situation with the mat2Swift dataset.
  • For the map2Anna, we see that the general depth is low, 25% of the off-target regions are covered at 1x for most of the samples.
  • For the map2Swift, we see that there are 3 situations:
    • Swift sample is the one that has more coverage in general
    • Anna and H10 have a “medium” level coverage
    • Rest of the sample behaves in the same way as in the other dataset.

Pending questions

  • Anna and Swift samples are WGS mapped to the target regions?

Depth of coverage

Coverage per sample

  • There is not much difference between the different off-target files.
  • The 0x coverage for AnnaBGI and SwiftBGI in the map2Anna datasets, is because these samples were not in that mapping.
  • It is possible to see that the general coverage of the off-target regions is < 1x for all the samples. Less than 1% of the on-target coverages seen.

Coverage per target

Min. 1st. Qu. Median Mean 3rd. Qu Max. Var. Std. Dev
map2Anna | Off-target 1 0 1.518 7.4 48.29 21.89 34,600 243,841.7 493.8033
map2Anna | Off-target 2 0 0 0.4007 28.4 10.18 27,260 153,951.500672313 392.366538675653
map2Swift | Off-target 1 0 0.6185 2.794 34.51 9.069 37,860 337,580.14666756 581.016477105048
map2Swift | Off-target 2 0 0 0.1844 17.69 2.691 66,150 254,179.229772137 504.16190829151
  • Most of the targets are below 21x

ANGSD Depth calculations

This computes depth distribution for every sample and for all samples jointly. Computed wiht ANGSD [angsd version: 0.916 (htslib: 1.3.2) build(May  2 2017 16:21:49)].

angsd -bam bam.filelist -doDepth 1 -out all -doCounts 1 -maxDepth 1000

Output of ANGSD are 2 files:

  • .depthSample: This file contains nInd number of lines. Column1 is the number sites that has sequencing depth=0,Column2 is the number of sites that has sequencing depth=1 etc

  • .depthGlobal: The sequencing depth for all samples.

Relation within coverage and phylogenetic distance

I am trying to analyze if there is any correlation between the depth of coverage and the phylogenetic distance from the reference species, used for the probe generation and mapping.

Inferred phylogeny from Hummingbirds

Information of the phylogenetic reconstruction I got from the Hummingbirds Paper, this says that it was made with mitochondrial genome and nuclear gene trees using either a concatenation approach with RAxML [3] or the multi-species coalescent approach implemented in ASTRAL [4] and ASTRAL-II [5].

The phylogenies were built using subsets of the nuclear genes present the same topology between the main groups of hummingbirds with high confidence. The three subsets correspond to:

  1. the 2949 genes that were successfully captured (in a minimum of 8 species)
  2. the 1987 genes for which we could assign a high confidence swift orthologs determined in [5] (used as a root)
  3. the 741 genes within the latter that produced trees with average support above 50%.

This is the Supplementary Table S5 (from the Hummingbirds Paper), describing the subsets selected:

number of genes minimum number of sites per species (concatenated) maximum number of sites per species (concatenated) includes outgroup ASTRAL score
741 949,470 1,542,750 yes 87%
1987 1,667,071 2,783,832 yes 78%
2949 2,473,664 3,792,500

Pending questions

Is this really this dataset? weren’t we working with originally 2750 genes?

I’ll be using only the subset for the 2949 genes.

  • Cpe, outgroup reference.
  • Aan, ingroup reference.
Inferred phylogeny

Inferred phylogeny

Distance Matrix from inferred phylogeny

To obtain the distance matrix I used the tree obtained with RAxML, from the concatenation of 2949 genes. The tree gives me the branch lengths in number of substitutions per site. So, the distance calculated here, is the pairwise distance of the tips. This was done with the R package phangorn [7]

sampleColors = colorRampPalette(brewer.pal(9, "Set1"))(48)
treefile="/home/merly/git/cph-visit/coverage-analysis/capture-phylotenetic-decay/trees/RAxML_bipartitions.2011.concat"
tree=read.nexus(treefile)
distMatrix=cophenetic(tree)
dCpe=distMatrix["Cpe",]
dAan=distMatrix["Aan",]
distMatrix[upper.tri(distMatrix)]=NA
Distance matrix

Distance matrix

Depth of coverage vs. phylogenetic distance

I have done this following the figure F3A in Braggs’s paper [8], where it shows sequencing coverage as a function of genetic divergence. I have calculated the depth of coverage for both datasets and phylogenetic distance, now I have the relationship plot:

Relation

Correlations

Anna

## 
## Call:
## lm(formula = disCoverageAnna$distance ~ disCoverageAnna$median)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.022120 -0.008747 -0.003240  0.009100  0.038334 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             3.319e-02  2.885e-03  11.504 7.47e-15 ***
## disCoverageAnna$median -1.549e-05  2.942e-05  -0.526    0.601    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01253 on 44 degrees of freedom
## Multiple R-squared:  0.006258,   Adjusted R-squared:  -0.01633 
## F-statistic: 0.2771 on 1 and 44 DF,  p-value: 0.6013

CHECK THIS OUT!

interpret this

Swift

## 
## Call:
## lm(formula = disCoverageSwift$distance ~ disCoverageSwift$median)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0072130 -0.0028632 -0.0006428  0.0020115  0.0198364 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.262e-01  1.054e-03 119.698   <2e-16 ***
## disCoverageSwift$median -3.490e-06  1.273e-05  -0.274    0.785    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.004584 on 44 degrees of freedom
## Multiple R-squared:  0.001705,   Adjusted R-squared:  -0.02098 
## F-statistic: 0.07515 on 1 and 44 DF,  p-value: 0.7853

CHECK THIS OUT!

interpret this


References

  • [1]: Bleiweiss R, Kirsch JAW, Matheus JC (1994) DNA-DNA Hybridization Evidence for Sub-family Structure among Hummingbirds. Auk 111(1):8–19. DOI: 10.2307/4088500
  • [2]: McGuire JA, et al. (2014) Molecular phylogenetics and the diversification of hummingbirds. Curr Biol 24(8):910–6. DOI: 10.1016/j.cub.2014.03.016
  • [3]: Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9):1312–1313. DOI: 10.1093/bioinformatics/btu033
  • [5]: Mirarab S, Warnow T (2015) ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31(12):i44–52. DOI: 10.1093/bioinformatics/btv234
  • [6]: Jarvis ED, et al. (2014) Whole Genome Analyses Resolve the Early Branches in the Tree of Life of Modern Birds. Science (80- ) 346(6215):1320–1331. DOI: 10.1126/science.1253451
  • [8]: Bragg, JG., Potter, S., Bi. K. and Moritz, C. (2016) Exon capture phylogenomics: efficacy across scales of divergence Molecular Ecology Resources 16, 1059–1068. DOI: 10.1111/1755-0998.12449